-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
🐛 skip drain node if DrainingSucceeded condition is already true #11050
Conversation
Welcome @liuxu623! |
Hi @liuxu623. Thanks for your PR. I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/area machine |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/ok-to-test
I'm not sure if we should skip draining just because it succeeded in the past What is your use case for this? |
@sbueringer In our environment, when I scale down a MachineDeployment, I found it will stuck at drain node.
|
If the node doesn't exist anymore "drain node" should just skip over the drain |
Can you provide some logs so we can figure out why drainNode is not skipped? Context: cluster-api/internal/controllers/machine/machine_controller.go Lines 640 to 644 in 32083e4
|
@sbueringer Because the node has not been deleted after drain node succeed. |
after delete InfraMachine, MachineController will return at https://github.com/kubernetes-sigs/cluster-api/blob/v1.8.1/internal/controllers/machine/machine_controller.go#L452 |
And at which point is it failing? |
@sbueringer when we delete a machine first Reconcile
after InfraMachine deleted
|
It's entirely unclear to me how "drain node failed" happens. We have this case where we just skip the drain if the node doesn't exist: cluster-api/internal/controllers/machine/machine_controller.go Lines 640 to 644 in 32083e4
|
Why did you think the node is not exist? Machine controller has not deleted node... |
I think I was misreading one of your comments. This would be easier if you could provide logs |
|
@sbueringer Sorry, I try reproduce the problem and check the log, I found the root case is some workload's tolerations is
It will recreate pod after drain success. |
Wondering if this can be solved with a similar change to this one: #11024 (comment) (Heads up: I'm currently refactoring this whole drain behavior here: #11074 Should be done in a few days) |
I would recommend to join the discussion on this issue: #11024 |
@@ -368,7 +368,7 @@ func (r *Reconciler) reconcileDelete(ctx context.Context, cluster *clusterv1.Clu | |||
conditions.MarkTrue(m, clusterv1.PreDrainDeleteHookSucceededCondition) | |||
|
|||
// Drain node before deletion and issue a patch in order to make this operation visible to the users. | |||
if r.isNodeDrainAllowed(m) { | |||
if r.isNodeDrainAllowed(m) && !conditions.IsTrue(m, clusterv1.DrainingSucceededCondition) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this true? When might we need to perform a second drain?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In my opinion we should not skip drain if it completed once.
This would somehow allow to not drain again once all pods were gone, but this is only relevant in cases where Pods are rescheduled. But if one of these Pods has a volume we're then stuck in wait for volume detach
/hold See comments above |
@liuxu623 this scenario is now covered by #11241, you ok to close this? |
ok |
What this PR does / why we need it:
If
DrainingSucceeded
condition is already true, we shouldn't drain node again.